Human head detection method, eletronic device and storage medium

ABSTRACT

A method for detecting and tracking human head in an image by an electronic device is disclosed. The method may include segmenting the image into one or more sub-images; inputting each sub-image to a convolutional neural network trained according to training images having marked human head positions; outputting by a preprocessing layer of the convolutional neural network comprising a first convolutional layer and a pooling layer, a first feature corresponding to each sub-image; mapping through a second convolutional layer the first feature corresponding to each sub-image to a second feature corresponding to each sub-image; mapping through a regression layer the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and filtering, according to the corresponding confidence level, human head positions corresponding to the one or more sub-images, to acquire detected human head positions in the image.

RELATED APPLICATION

This application is a continuation application of the International PCTApplication No. PCT/CN2018/070008, filed with the Chinese Patent Officeon Jan. 2, 2018 and claims priority to Chinese Patent Application No.2017100292446, filed with the Chinese Patent Office on Jan. 16, 2017 andentitled “HUMAN HEAD DETECTION METHOD AND APPARATUS”, which isincorporated herein by reference in its entirety.

FIELD OF THE TECHNOLOGY

This application relates to the technical field of image processing, andin particular, to a method, an electronic device and a storage mediumfor human head detection.

BACKGROUND OF THE DISCLOSURE

Human head detection refers to the detection of the head of a human bodyin an image, and a result of the human head detection has variousapplications, such as applications in the field of security. At present,the human head detection is implemented mainly based on a shape andcolor of a human head. At present, a specific process of the human headdetection includes: first, binarizing image pixels, and then performingedge detection to acquire a substantially circular edge; and then usingcircle detection to acquire a position and size of the circular edge,and then performing gray scale and size determination on a correspondingcircular area in the original image to obtain human head detection.

However, currently, the human head detection relies on an assumptionthat the shape of the human head is circular. In fact, the shape of thehuman head is not strictly circular, and the shapes of the human headsof different person are also different. As a result, during the currenthuman head detection, some human heads miss the detection and accuracyof the result of the human head detection is relatively low.

SUMMARY

According to various embodiments provided by this disclosure, methods,an electronic devices and a storage media are provided for implementinghuman head detection in images.

A human head detection method includes:

segmenting, by an electronic device, an image to be detected into one ormore sub-images;

inputting, by the electronic device, each sub-image to a convolutionalneural network trained according to a training image having a markedhuman head position respectively, and outputting, by a preprocessinglayer including a convolutional layer and a pooling layer in theconvolutional neural network, a first feature corresponding to eachsub-image;

mapping, by the electronic device through a convolutional layer afterthe preprocessing layer in the convolutional neural network, the firstfeature corresponding to each sub-image to a second featurecorresponding to each sub-image;

mapping, by the electronic device through a regression layer in theconvolutional neural network, the second feature corresponding to eachsub-image to a human head position corresponding to each sub-image and acorresponding confidence level of the human head position; and

filtering, by the electronic device according to the correspondingconfidence level, the human head position corresponding to eachsub-image, to acquire a human head position detected in the image to bedetected.

An electronic device includes a memory and a processor, the memorystoring a computer readable instruction, the computer readableinstruction, when executed by the processor, causing the processor toperform the following steps:

segmenting an image to be detected into one or more sub-images;

inputting each sub-image to a convolutional neural network trainedaccording to a training image having a marked human head positionrespectively, and outputting, by a preprocessing layer including aconvolutional layer and a pooling layer in the convolutional neuralnetwork, a first feature corresponding to each sub-image;

mapping, through a convolutional layer after the preprocessing layer inthe convolutional neural network, the first feature corresponding toeach sub-image to a second feature corresponding to each sub-image;

mapping, through a regression layer of the convolutional neural network,the second feature corresponding to each sub-image to a human headposition corresponding to each sub-image and a corresponding confidencelevel of the human head position; and

filtering, according to the corresponding confidence level, the humanhead position corresponding to each sub-image, to acquire the human headposition detected in the image to be detected.

One or more non-volatile storage media storing a computer readableinstruction is provided, the computer readable instruction, whenexecuted by one or more processors, causing the processor to perform thefollowing steps:

segmenting an image to be detected into one or more sub-images;

inputting each sub-image to a convolutional neural network trainedaccording to a training image having a marked human head positionrespectively, and outputting, by a preprocessing layer including aconvolutional layer and a pooling layer in the convolutional neuralnetwork, a first feature corresponding to each sub-image;

mapping, through a convolutional layer after the preprocessing layer inthe convolutional neural network, the first feature corresponding toeach sub-image to a second feature corresponding to each sub-image;

mapping, through a regression layer in the convolutional neural network,the second feature corresponding to each sub-image to a human headposition corresponding to each sub-image and a corresponding confidencelevel of the human head position; and

filtering, according to the corresponding confidence level, the humanhead position corresponding to each sub-image, to acquire the human headposition detected in the image to be detected.

Details of one or more embodiments of this application are put forwardin the following accompanying drawings and descriptions. Other features,objectives, and advantages of this application become more obvious withreference to the specification, the accompanying drawings, and theclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

To describe the technical solutions of the embodiments of thisapplication more clearly, the following briefly introduces theaccompanying drawings required for describing the embodiments. Theaccompanying drawings described below are only some embodiments of thisapplication, and a person of ordinary skill in the art can obtain otheraccompanying drawings according to these accompanying drawings withoutcreative efforts.

FIG. 1 shows an application environment diagram of a human headdetection method according to an embodiment.

FIG. 2 shows a schematic diagram of an internal structure of anelectronic device according to an embodiment.

FIG. 3 shows a schematic flowchart of a human head detection methodaccording to an embodiment.

FIG. 4 shows a schematic structural diagram of a convolutional neuralnetwork according to an embodiment.

FIG. 5 shows a schematic flowchart for converting a convolutional neuralnetwork for image classification to a convolutional neural network forhuman head detection.

FIG. 6 is a schematic flowchart for filtering human head positionsaccording to confidence levels.

FIG. 7 is a schematic flowchart for implementing step 606 of FIG. 6.

FIG. 8 is a schematic flowchart of performing human head tracking andperforming people counting according in a video frame by frame.

FIG. 9 is a schematic flowchart for detecting a human head position in acurrent video image frame near a human head position tracked in aprevious video frame and continuing to track the human head when thetracking of a human head position is interrupted at the previous videoframe.

FIG. 10 illustrates an application scenario for human head detection andtracking.

FIG. 11 is a schematic diagram of performing people counting by usingtwo parallel lines according to an embodiment.

FIG. 12 is a structural block diagram of a human head detectionapparatus according to an embodiment.

FIG. 13 is a structural block diagram of a human head detectionapparatus according to another embodiment.

FIG. 14 is a structural block diagram of a human head detection resultdetermining module according to an embodiment.

FIG. 15 is a structural block diagram of a human head detectionapparatus according to still another embodiment.

FIG. 16 is a structural block diagram of a human head detectionapparatus according to yet another embodiment.

DESCRIPTION OF EMBODIMENTS

To make the objectives, technical solutions, and advantages of thisapplication clearer, the following disclosure further describes thisapplication in detail with reference to the accompanying drawings andembodiments. It should be understood that specific embodiments describedherein are merely intended to explain this application and are notintended to limit this application.

While the disclosure herein specifically refer to human head detectionin top view images, the underlying principle may be applied to detectionof other objects in any type of images. For example, the systems andmethods disclosed below may be applied to detection of motor vehicles insatellite images for monitoring traffic and the like.

FIG. 1 is an application environment diagram of a human head detectionmethod according to an embodiment. Referring to FIG. 1, the human headdetection method is applied to a human head detection system, whichincludes an electronic device 110 and a top view camera 120 connected tothe electronic device 110. The top view camera 120 is configured tocapture an image to be detected and send the image to be detected to theelectronic device 110. The top view camera may be mounted on the top ofa building or at a wall above the height of a person (or a predeterminedheight) or at a corner of the top of the building, so that the top viewcamera can capture images of a top view angle. The top view may beorthographic top view or top view of an oblique angle (alternativelyreferred to as perspective top view).

In an embodiment, the electronic device 110 may be configured to segmentan image to be detected into one or more sub-images; input eachsub-image to a convolutional neural network trained according totraining images having marked human head positions (or labeled withhuman heads)y, and output, by a preprocessing layer including at leastone convolutional layer and at least one pooling layer in theconvolutional neural network, a first feature corresponding to eachsub-image; map, through at least one another convolutional layer afterthe preprocessing layer in the convolutional neural network, the firstfeature corresponding to each sub-image to a second featurecorresponding to each sub-image; map, through at least one regressionlayer of the convolutional neural network, the second featurecorresponding to each sub-image to a human head position correspondingto each sub-image and a corresponding confidence level of the human headposition; and filter, according to the corresponding confidence level,the human head position corresponding to each sub-image, to acquire ahuman head positions detected in the image to be detected.

FIG. 2 is a schematic diagram of an internal structure of an electronicdevice according to an embodiment. Referring to FIG. 2, the electronicdevice includes a processor, a memory and a network interface which areconnected by a system bus. The memory includes a non-volatile storagemedium and an internal memory. The non-volatile storage medium of thecomputer device may store an operating system and computer readableinstructions. When being executed, the computer readable instruction maycause the processor to perform a human head detection method. Theprocessor of the electronic device may include a central processing unitand a graphics processing unit. The processor is configured to providecomputing and control capabilities to support operation of theelectronic device. The internal memory may store the computer readableinstruction. When being executed by the processor, the computer readableinstruction may cause the processor to perform a human head detectionmethod. The network interface of the electronic device is configured tobe connected to the top view camera. The electronic device may beimplemented by a integrated electronic device or a cluster includingmultiple electronic devices. The electronic device may be a personalcomputer, a server or a dedicated human head detection device. Thosehaving ordinary skills in the art can understand that the structureshown in FIG. 2 is only a block diagram of a part of the structurerelated to the solution of this application, and does not constitutelimitation on the electronic device to which the solution of thisapplication is applied. The specific electronic device may include moreor fewer components than those shown in the figure, or may combine somecomponents, or have different component arrangement.

FIG. 3 is a schematic flowchart of a human head detection methodaccording to an embodiment. This embodiment is mainly illustrated byapplying the method to the electronic device 110 in above FIG. 1 andFIG. 2. Referring to FIG. 3, the human head detection methodspecifically includes the following steps:

S302: Segment an image to be detected into one or more sub-images.

The image to be detected is an image on which human head detection needsto be performed. The image to be detected may be a picture or a videoframe in a video. The sub-images are images which are segmented from theimage to be detected and have a size smaller than the image to bedetected. All segmented sub-images may have the same size or differentsizes.

Specifically, the electronic device may traverse a window of a fixedsize in the image to be detected according to a transverse step size anda longitudinal step size, thereby segmenting the sub-images having thesame size as the window size from the image to be detected during thetraversal process. The segmented sub-images may be combined to the imageto be detected.

In an embodiment, step S302 includes: segmenting the image to bedetected to one or more sub-images of a fixed size, adjacent sub-imagesin the segmented sub-images having an overlapping part.

The adjacent sub-images refer to that positions of the sub-images in theimage to be detected are adjacent, and the adjacent sub-images maypartially overlap. Specifically, the electronic device may traverse thewindow of a fixed size in the image to be detected according to thetransverse step size smaller than a window width and the longitudinalstep size smaller than a window height, to acquire one or moresub-images of the same size, and adjacent sub-images have an overlappingpart.

In this embodiment, there is an overlapping part between the segmentedadjacent sub-images, thereby ensuring that the adjacent sub-images havehigher correlation, and improving accuracy of detecting a human headposition from the image to be detected, particularly when a human headlies at boundary between the adjacent sub-images.

S304: Input each sub-image to a convolutional neural network trainedaccording to a set of training images having marked human headpositions, and output, by a preprocessing layer including at least oneconvolutional layer and at least one pooling layer in the convolutionalneural network, a first feature corresponding to each sub-image.

The Convolutional Neural Network (CNN) is an artificial intelligenceneural network. The convolutional neural network includes apreprocessing layer having at least one convolutional layer and at leastone pooling layer. The convolutional neural network used in thisembodiment may be directly constructed, and may alternatively beacquired by reconstructing an existing convolutional neural network. Acomputational task in the convolutional neural network may beimplemented by a central processing unit or a graphics processing unit.Time consumed by the central processing unit for human head detection isproximately a level of seconds, and time consumed by the graphicsprocessing unit for human head detection may be reduced to a level ofhundred milliseconds, thereby realizing real-time human head detection.

In the convolutional layer in the convolutional neural network, thereare a plurality of feature maps, each feature map includes a pluralityof neurons, and all neurons of the same feature map share oneconvolution kernel. The convolution kernel provides a weight of thecorresponding neuron, and the convolution kernel represents a feature.The convolution kernel is generally initialized in a form of a randomdecimal matrix, and a proper convolution kernel will be learned duringtraining of the network to represent a feature. The convolutional layercan reduce connection between various layers in the neural network, andin addition, a risk of overfitting is reduced.

Pooling may take two exemplary forms of implementation: mean pooling andmax pooling. Pooling may be considered as a special convolutionalprocess. Convolution and pooling greatly simplify complexity of theneural network and reduce parameters of the neural network.

The training images having human heads therein may be pre-marked (orlabeled) with human head positions For example, human head positions inthe training images may be manually marked or labeled, or may be markedor labeled using other automatic means. The training images having themarked human head positions and the image to be detected may be imagescaptured in the similar scene, setting or background, thereby furtherimproving the accuracy of human head detection. The training imagehaving marked human head positions can be of the same size or differentsizes as the image to be detected.

In an embodiment, when the convolutional neural network is trained, aconfidence level may be assigned to the human head position marked inthe training image. The training image is segmented into one or moresub-images according to the same segmentation manner as that of theimage to be detected. The segmented sub-images are separately input tothe convolutional neural network, and the convolutional neural networkoutputs human head positions and confidence levels. A difference betweenthe output head positions and the marked head position is calculated,and a difference between the corresponding confidence levels iscalculated. According to the two differences, parameters of theconvolutional neural network are adjusted. The training is continueduntil a termination condition is reached. The termination condition maybe that each difference is less than a preset difference threshold, orthe number of iterations reaches a preset number of times.

The preprocessing layer is used above as a general term of other layersin the convolutional neural network except for the regression layer anda convolutional layer before the regression layer. The preprocessinglayer may include at least one convolutional layer and at least onepooling layer. The preprocessing layer may include parallelconvolutional layers, and data output by the parallel convolutionallayers may be spliced and input to a next layer. The last layer in thepreprocessing layer may be a convolutional layer or a pooling layer. Thepreprocessing layer may include multiple pairs of convolutional layerand pooling layer connected in tandem. The preprocessing layer mayinclude additional rectifying layers.

S306: Map, through a convolutional layer after the preprocessing layerin the convolutional neural network, a first feature corresponding toeach sub-image to a second feature corresponding to each sub-image.

A conventional convolutional neural network is generally used forclassification, and the preprocessing layer in the convolutional neuralnetwork for classification is followed by a fully connected layer. Thefully connected layer may map the first feature output by thepreprocessing layer to probability data corresponding to each presettype (or class). Therefore, a type to which an input image belongs maybe determined by the regression layer. In this embodiment, theconvolutional neural network is used for human head detection ratherthan classification. As such, the convolutional layer is configured toreplace the fully connected layer, and to output the second feature fordescribing the sub-image features. The number of the second featurescorresponding to each sub-image may be plural.

S308: Map, through a regression layer in the convolutional neuralnetwork, the second feature corresponding to each sub-image to a humanhead position corresponding to each sub-image and a confidence levelcorresponding to the human head position.

The human head position may be represented by a position of arectangular box bounding a human head in the image. The position of therectangular box may be represented by a quadruple. The quadruple mayinclude a horizontal coordinate and a longitudinal coordinate of onevertex of the rectangular box and a width and a height of therectangular box. Alternatively, the quadruple may include a horizontalcoordinate and a longitudinal coordinate of each of two diagonalvertexes of the rectangular box. The confidence level output by theregression layer are in a one-to-one correspondence with the human headposition output by the regression layer, thereby indicating aprobability that the corresponding rectangular box does correspond to ahuman head at the corresponding position in the image. The regressionlayer may use a support vector machine (SVM).

In an embodiment, step S308 includes: mapping, through the convolutionallayer in the regression layer in the convolutional neural network, thesecond feature corresponding to each sub-image to the human headposition corresponding to each sub-image and the confidence levelcorresponding to the human head position. Specifically, the electronicdevice may directly map the second feature corresponding to eachsub-image to the human head position corresponding to each sub-image andthe confidence level corresponding to the human head position throughthe same convolutional layer in the regression layer in theconvolutional neural network.

In an embodiment, step S308 includes: mapping, through a firstconvolutional layer in the regression layer in the convolutional neuralnetwork, the second feature corresponding to each sub-image to the humanhead position corresponding to each sub-image; and mapping, through asecond convolutional layer in the regression layer in the convolutionalneural network, the second feature corresponding to each sub-image tothe confidence level corresponding to the output human head position.

For example, referring to FIG. 4, the sub-image outputs 128 featurematrices (feature maps) each with a size M*N through the preprocessinglayer in the convolutional neural network. 128 is a preset value fornumber of features, and can be set as needed. M and N are determined byparameters of the preprocessing layer. The 128 feature matrices with thesize M*N are input to the convolutional layer after the preprocessinglayer. By performing convolution processing by using a parameter matrixwith a size 128*1024 in the convolutional layer, M*N feature vectorswith a length 1024 are output. The M*N feature vectors with the length1024 are input to the first convolutional layer in the regression layer,and are convoluted by a parameter matrix with a size 1024*4 in the firstconvolutional layer, and M*N quadruples representing the human headposition are output. The M*N feature vectors with the length 1024 areinput to the second convolutional layer in the regression layer, and areconvoluted by a parameter vector with a size 1024*1 in the secondconvolutional layer, and M*N tuples indicating the confidence level thehuman head position are output. A correspondence relationship betweenthe human head position and the confidence level is embodied in an orderof the output M*N quadruples and the tuples.

S310: Filter, according to the corresponding confidence level, the humanhead position corresponding to each sub-image, and acquire a human headposition detected in the image to be detected.

Specifically, the electronic device may compare the confidence level ofeach human head position output by the convolutional neural network witha confidence level threshold, and filter out human head positions ofwhich confidence levels are less than the confidence level threshold.The electronic device may further filter the human head positions, ofwhich areas are smaller than a preset area, in the human head positionsfiltered by using the confidence level threshold. The electronic devicemay cluster the filtered human head positions to combine the pluralityof human head positions of the same type to acquire one combined humanhead position in the image to be detected, or select one of theplurality of human head positions clustered to the same type as thehuman head position in the image to be detected.

According to the foregoing human head detection method, theconvolutional neural network is trained in advance based on the trainingimages having the marked human head position, and the convolutionalneural network can automatically learn human head features. The trainedconvolutional neural network can automatically extract appropriatefeatures from the sub-images to output candidate human head positionsand corresponding confidence levels, and then filter, according to theconfidence levels, to acquire the human head position in the image to bedetected. The human head shape is learned rather than pre-assumed. Assuch a missed detection caused by presuming the shape of the human headcan be avoided, and accuracy of the human head detection is improved.Moreover, in the convolutional neural network, the first features of thesub-images are output by the preprocessing layer including theconvolutional layer and the pooling layer, and the second features areoutputted by the convolutional layer after the preprocessing layer andbefore the regression layer to accurately describe human head featuresin the sub-images. Therefore, the second features are directly mapped tothe human head positions and confidence levels by the regression layer,which is new application of the convolutional neural network of the newstructure. Compared with the traditional circle detection, the accuracyof the human head detection is greatly improved.

In an embodiment, before step S302, the human head detection methodfurther includes a step of converting and training the convolutionalneural network for classification to a convolutional neural network forhuman head detection. Referring to FIG. 5, the step of converting andtraining the convolutional neural network for classification to aconvolutional neural network for human head detection includes thefollowing steps:

S502: Convert a fully connected layer after the preprocessing layer andbefore the regression layer included in the convolutional neural networkfor classification to a convolutional layer.

A conventional convolutional neural network for classification is atrained convolutional neural network which can classify images input tothe convolutional neural network, such as GoogleNet, VGGNET, or AlexNet.The convolutional neural network for classification includes thepreprocessing layer, the fully connected layer, and the regressionlayer. The fully connected layer is configured to output second featurescorresponding to each preset type (or class) of the conventionalclassification application.

The sparse connection and weight sharing of the fully connected layerand the convolutional layer are different. Each neuron of the fullyconnected layer is connected to all neurons of a preceding layer. Boththe convolutional layer and the fully connected layer acquire input of anext layer by multiplying output of the preceding layer by a parametermatrix. As such, the conventional fully connected layer can be convertedto the convolutional layer by changing an arrangement manner ofparameters of the fully connected layer.

S504: Replace the regression layer in the convolutional neural networkfor classification with a regression layer configured to map the secondfeature output by the converted convolutional layer to the human headposition and the corresponding confidence level.

In the conventional convolutional neural network for classification, theregression layer is configured to map the second features of each presettype output by the fully connected layer to a probability correspondingto each preset type, and determine, according to the mapped probability,a preset type to which the image belongs. For example, a preset typecorresponding to a maximum probability is selected as the preset type towhich the input image belongs.

In the convolutional neural network for human head detection of thisdisclosure, the regression layer is configured to map a preset number ofsecond features output by the converted convolutional layer to the humanhead positions and the confidence levels corresponding to the human headpositions. The regression layer may use a convolutional layer. Theconvolutional layer directly maps the second features to the human headpositions and the confidence levels corresponding to the human headpositions. The regression layer may also use two convolutional layers inparallel. One convolutional layer is configured to map the secondfeatures to the human head positions, and the other convolutional layeris configured to map the second features to the confidence levelscorresponding to the mapped human head positions.

S506: Train the convolutional neural network including the preprocessinglayer, the converted convolutional layer, and the replaced regressionlayer by using the training images having the marked human headpositions.

The convolutional neural network including the preprocessing layer, theconverted convolutional layer and the replaced regression layer isreconstructed and modified from the conventional convolutional neuralnetwork for classification applications. In one implementation,parameters of the preprocessing layer may be pre-trained. Then for thereconstructed convolutional neural network, mainly the parameters in theconverted convolutional layer and the replaced regression layer need tobe trained. The training processing may be joint process. For example,the entire network may be trained. The preprocessing layer trainingparameters may be initialized to its pre-trained parameters andretrained together with the rest of network.

Specifically, when the reconstructed convolutional neural network istrained, the confidence level may be pre-assigned to the marked humanhead positions of the training image. The training image is segmentedinto one or more sub-images according to the same segmenting manner asthat of the image to be detected. The segmented sub-images arerespectively input to the convolutional neural network, and the humanhead positions and the confidence levels are output by the preprocessinglayer, the convolutional layer after the preprocessing layer, and theregression layer of the convolutional neural network. The differencebetween the output human head positions and the marked human headposition is calculated, and the difference between the correspondingconfidence levels is calculated, and the parameters in the preprocessinglayer, the convolutional layer after the preprocessing layer, and theregression layer in the convolutional neural network are adjustedaccording to the two differences. The training is continued until atermination condition is reached. The termination condition may be thatthe difference is less than a preset difference, or the number ofiterations reaches a preset number of times.

In this embodiment, the training is performed after reconstruction ofconventional neural network for classification into a the convolutionalneural network for human head detection. The reconstruction does notrequire complete redesign of the neural network, training duration canbe reduced and efficiency of human head detection is improved.

As shown in FIG. 6, in an embodiment, step S310 specifically includesthe following steps:

S602: Screen, from the human head positions corresponding to thesub-images, to acquire a human head position corresponding to aconfidence level greater than or equal to a confidence level threshold.

Specifically, the electronic device may form the human head positionsrespectively corresponding to the sub-images segmented from the image tobe detected into a human head position set, traverse the human headposition set, and compare confidence levels the traversed human headpositions with the confidence level threshold. The human head positionshaving confidence levels lower than the confidence level threshold maybe removed from the human head position set. The remaining human headpositions in the human head position set after the traversing are theacquired human head positions of which the corresponding confidencelevels are greater than or equal to the confidence level threshold. Theconfidence level threshold may be set as needed, for example, may bevalued from 0.5 to 0.99.

S604: Selecting or identifying human head positions among the screenedhuman head positions in S602 that intersect in the image to be detected.

The intersection of the human head positions means that enclosed areasindicated by respective human head positions have an intersection in theimage to be detected. When the human head position is represented by aposition of a rectangular box including the human head image, theintersection of the human head positions is the intersection of thecorresponding rectangular boxes. Specifically, the electronic device mayselect a human head position intersecting with the acquired human headposition in the image to be detected from the human head position setformed by the human head positions respectively corresponding to all thesub-images segmented from the image to be detected. The electronicdevice may also seek for the intersecting human head positions from onlythe acquired human head positions.

S606: Determine, according to the acquired human head position and theidentified human head position, the human head position detected in theimage to be detected.

Specifically, the electronic device may classify the acquired human headpositions and the selected human head positions. Each type includes atleast one of the acquired human head positions, and also includes humanhead positions intersecting with the at least one human head position.The electronic device may combine the human head positions of each typeto one human head position as a detected head position, or select onehuman head position from the human head positions of each type as thedetected human head position.

In this embodiment, the accuracy of human head detection can be furtherimproved by using the confidence levels and the position intersectionthe basis for determining the human head position in the image to bedetected.

As shown in FIG. 7, in an embodiment, step S606 specifically includesthe following steps:

S702: Use the acquired human head position (from step 602) and theselected human head position (from step 604) as nodes in a bipartitegraph, as a first group and second group, respectively.

The bipartite graph is a graph in the graph theory, the nodes in thebipartite graph may be segmented into two groups, and all edgesconnected to the nodes are caused to span boundaries of the groups.

S704: Assign default and positive weights to the edges between the nodesin the bipartite graph.

There is an edge between each acquired human head position and thecorrespondingly selected intersecting human head position. The defaultand positive weight is a positive value, such as 1000.

S706: Reduce the correspondingly assigned weights for the correspondingedges that are associated with nodes representing intersecting headpositions.

Specifically, when the human head positions indicated by the nodesassociated with the edges intersect, the electronic device may subtracta positive value less than the default and positive weight from thecorrespondingly assigned weight, and then divide the subtracted value bythe default and positive weight to acquire an updated weight. If thedefault and positive weight is 1000, and the positive value less thanthe default and positive weight is 100, then the updated weight is(1000−100)/1000=0.9.

S708: Solve a maximum weight edge combination of the bipartite graph,and acquire the human head position detected in the image to bedetected.

An edge combination in the bipartite graph is a set of edges have nocommon nodes. If a particular weight sum of edges of one of all the edgecombinations of one bipartite graph is largest, this particular edgecombination is referred to as the maximum weight edge combination. Theelectronic device may traverse all edge combinations in the bipartitegraph to find the maximum weight edge combination. The electronic devicemay also use a Kuhn-Munkres algorithm to solve the maximum weight edgecombination of the bipartite graph. After the maximum weight edgecombination is solved, the human head positions associated with theedges in the maximum weight edge combination can be used as the humanhead position detected in the image to be detected.

In this embodiment, since the intersecting human head positions maycorrespond to the same human head, the human head positions output bythe convolutional neural network are mostly gathered near the actualhuman head position in the image to be detected. Therefore, the acquiredhuman head positions (step 602, for example) and the selected human headpositions (step 605, for example) are used as the nodes in the bipartitegraph to construct the bipartite graph, and weights of the correspondingedges of the intersecting human head positions are reduced. By solvingthe maximum weight edge combination, the detected human head position inthe image to be detected are acquired, and the human head detection canbe performed more accurately.

In an embodiment, the image to be detected may be a video frame in avideo, and the human head detection method further includes a step ofperforming human head tracking and performing people counting frame byframe. Referring to FIG. 8, the step of performing human head trackingand performing people counting frame by frame specifically includes thefollowing steps:

S802: Perform human head tracking video according to the human headposition detected in the image frame to be detected frame by videoframe.

Specifically, after detecting the human head position in one videoframe, the electronic device performs the human head tracking videoframe by video frame by using the detected human head position as astarting point. The electronic device may specifically use a mean shift(average drift) tracking algorithm, an optical flow tracking algorithm,or a tracking-learning-detection (TLD) algorithm.

S804: Determine a moving direction and a positional relationship of thetracked human head position relative to a designated area.

The designated area refers to the area designated in the video frame.The moving direction of the tracked human head position relative to thedesignated area refers to that the human head position is, for example,moving toward or away from the designated area. The positionalrelationship of the tracked human head position relative to thedesignated area refers to that the human head position is inside oroutside the designated area.

In an embodiment, when the tracked human head position crosses a linerepresenting a boundary of the designated area in a direction toward thedesignated area, it is determined that the tracked human head positionenters the designated area. When the tracked human head position crossesthe line representing the boundary of the designated area in a directionaway from the designated area, it is determined that the tracked humanhead position leaves the designated area.

In an embodiment, when the tracked human head position sequentiallycrosses a first line and a second line parallel with the first line, itis determined that the tracked human head position enters the designatedarea. When the tracked human head position sequentially crosses thesecond line and the first line, it is determined that the tracked humanhead position leaves the designated area.

The parallel first line and second line may be straight lines or curvedlines. The designated area may be one of two areas formed by segmentingthe image to be detected by the second line, without including the firstline. In this embodiment, the moving direction and the positionalrelationship of the tracked human head position relative to thedesignated area are determined by the two lines, thereby preventing ajudgment error caused by movement of the human head position in thevicinity of the boundary of the designated area, thereby ensuring thecorrectness of people counting.

S806: Perform people counting according to the determined movingdirection and positional relationship.

The people counting may be specifically counting a combination of one ormore of the number of accumulated people entering the designated area,the number of accumulated people leaving the designated area, and thedynamic number of people entering the designated area. Specifically, theelectronic device may add 1 to the number of statistically accumulatedpeople entering the designated area, and/or add 1 to the number ofdynamic people entering the designated area when one tracked human headposition enters the designated area. The electronic device may add 1 tothe number of statistically accumulated people leaving the designatedarea, and/or subtract 1 from the number of dynamic people entering thedesignated area when one tracked human head position leaves thedesignated area

In this embodiment, the human head detection may applied to securityapplication. The people counting is performed according to the movingdirection and the positional relationship of the tracked human headposition relative to the designated area. Based on accurate human headdetection, the accuracy of people counting can be improved.

In an embodiment, the human head detection method further includes astep of detecting the human head position and continuing tracking nearthe human head position tracked in a previous video frame when thetracking of the human head position is interrupted. Referring to FIG. 9,the step specifically includes the following steps:

S902: Track and record the human head position video frame by videoframe.

Specifically, the electronic device tracks the detected human headposition with the detected human head position in the image to bedetected as a starting point, and records the tracked human headposition.

S904: Acquire a human head position tracked in a previous recorded videoframe if the tracking of the human head position in a current videoframe is interrupted.

Specifically, when a character moves quickly or lighting changes, thetracking of the human head position may be interrupted, and in thiscase, the human head position tracked in the previous video frame andrecorded during the tracking video frame by video frame is acquired.

S906: Detect human head positions in a local area covering the acquiredhuman head position (in step 904) in the current video frame.

The local area covering the acquired human head position is smaller thana size of one video frame, and larger than a size of the area occupiedby the human head position tracked in the previous video frame. A shapeof the local area may be similar to a shape of the area occupied by thehuman head position tracked in the previous video frame. A center of thelocal area may overlap with a center of the area occupied by the humanhead position tracked in the previous video frame.

Specifically, the electronic device may detect the human head positionsin the current video frame to find the human head positions belonging tothe local area. The electronic device may also detect the human headpositions only in the local area. The electronic device may specificallyuse the steps of steps S302 to S310 to detect the human head positionsin the local area in the current video frame. The detected human headpositions may be partially or entirely located in the local area. Theelectronic device may use the human head positions of which the centersare within the local area as the human head positions in the detectedlocal area, and the human head positions of which the centers areoutside the local area do not belong to the human head positions in thelocal area.

For example, when the human head position is represented by a positionof a rectangular box including the human head image, if a width of therectangular box tracked in the previous video frame is W and a height isH, a and b are set to coefficients greater than 1, then the local areamay be the rectangular area having a width of a*W and a height of b*Hand the same center as the rectangular box. If the center coordinates ofthe rectangular box tracked in the previous video frame are (X1, X2) andthe center coordinates of another rectangular box indicating the humanhead position are (X2, Y2), then when |X1−X2|<W/2 and |Y1−Y2|<H/2, therectangular box of which the center coordinates are (X2, Y2) isdetermined to be in the local area of the rectangular box of which thecenter coordinates are (X1, X2).

S908: Continue to perform step S902 from the human head positiondetected in the local area.

In this embodiment, when the tracking of the human head positions isinterrupted, the human head positions can be detected from the vicinityof the human head positions detected in the previous frame, and theinterrupted human head tracking can be recovered from the interruptionand continued. The human head detection and the human head tracking arecombined to ensure the continuity of the tracking. Further, the accuracyof people counting is ensured.

The specific principle of the foregoing human head detection method isdescribed below with a specific application scenario. A large number oftop view images at an elevator entrance scene are acquired in advance,and the human head positions in these top view images are marked orlabeled. For example, a quadruple is used to indicate the position ofthe human head image in a rectangular box 1001 in FIG. 10. Aconvolutional neural network for classification is selected, the fullyconnected layer after the preprocessing layer and before the regressionlayer is converted to a convolutional layer, and the regression layertherein is replaced with the regression layer configured to map thesecond feature output by the converted convolutional layer to the humanhead position and the corresponding confidence level, thereby retrainingthe convolutional neural network by using the marked top view image.

Referring to FIG. 11, in actual application, if the number of peopleentering and exiting a gate needs to be counted, a top view camera isdisposed above a gate, and the videos are captured by the top viewcamera and transmitted to an electronic device connected to the top viewcamera. The electronic device uses an image area sandwiched by a line1101 and a line 1104 in one of the video frames as an image to bedetected, and segments the image to be detected into one or moresub-images. Each sub-image is input to a convolutional neural networktrained by training images having a marked human head positions. Theconvolutional neural network outputs the human head positionscorresponding to each sub-image and the confidence level correspondingto the human head positions, thereby filtering, according to thecorresponding confidence level, the human head positions correspondingto each sub-image, and acquiring the human head positions detected inthe image to be detected.

Further, the electronic device performs human head tracking video frameby video frame according to the human head position detected in theimage to be detected, and it is determined that a tracked human headposition 1105 enters a designated area when the tracked human headposition 1105 sequentially crosses a first line 1102 and a second line1103 parallel with the first line 1102. When a tracked human headposition 1106 sequentially crosses the second line 1103 and the firstline 1102, it is determined that the tracked human head position 1106leaves the designated area. The designated area in FIG. 11 may bespecifically the area sandwiched by the second line 1103 and a line1104.

In an embodiment, an electronic device is further provided, and aninternal structure of the electronic device may be shown in FIG. 2. Theelectronic device includes a human head detection apparatus. The humanhead detection apparatus includes various modules, and the modules maybe all or partially implemented by software, hardware or a combinationthereof.

FIG. 12 is a structural block diagram of a human head detectionapparatus 1200 according to an embodiment. Referring to FIG. 12, thehuman head detection apparatus 1200 includes a segmenting module 1210, aconvolutional neural network module 1220, and a human head detectionresult determining module 1230.

The segmenting module 1210 is configured to segment an image to bedetected into one or more sub-images.

The convolutional neural network module 1220 is configured to segmentthe image to be detected into one or more sub-images; input eachsub-image to a convolutional neural network trained according totraining images having marked human head positions, and output, by apreprocessing layer including at least one convolutional layer and atleast one pooling layer in the convolutional neural network, a firstfeature corresponding to each sub-image; map, through the convolutionallayer after the preprocessing layer in the convolutional neural network,the first feature corresponding to each sub-image to a second featurecorresponding to each sub-image; and map, through a regression layer ofthe convolutional neural network, the second feature corresponding toeach sub-image to a human head position corresponding to each sub-imageand a corresponding confidence level of the human head position.

The human head detection result determining module 1230 is configured tofilter, according to the corresponding confidence level, the human headposition corresponding to each sub-image, to acquire a human headposition detected in the image to be detected.

According to the human head detection apparatus 1200, the convolutionalneural network is trained in advance based on the training image havingthe marked human head position, and the convolutional neural network canautomatically learn human head features. The trained convolutionalneural network can automatically extract appropriate features from thesub-images to output candidate human head positions and correspondingconfidence levels, and then filter, according to the confidence levels,to acquire the human head position in the image to be detected. Thehuman head shape is not required to be assumed in advance, a misseddetection caused by setting the human head shape can be avoided, andaccuracy of the human head detection is improved. Moreover, in theconvolutional neural network, the first features of the sub-images areoutput by the preprocessing layer including the convolutional layer andthe pooling layer, and the second features are outputted by theconvolutional layer after the preprocessing layer and before theregression layer to accurately describe human head features in thesub-images. Therefore, the second features are directly mapped to thehuman head positions and confidence levels by the regression layer,which is new application of the convolutional neural network of the newstructure. Compared with the traditional circle detection, the accuracyof the human head detection is greatly improved.

In an embodiment, the segmenting module 1210 is further configured tosegment the image to be detected into one or more sub-images of a fixedsize, and adjacent sub-images in the segmented sub-images have anoverlapping part. In this embodiment, there is an overlapping partbetween the adjacent segmented sub-images, thereby ensuring that theadjacent sub-images have stronger correlation, and improving accuracy ofdetecting a human head position from the image to be detected.

As shown in FIG. 13, in an embodiment, the human head detectionapparatus 1200 further includes a convolutional neural network adjustingmodule 1240 and a training module 1250.

The convolutional neural network adjusting module 1240 is configured toconvert a fully connected layer after the preprocessing layer and beforethe regression layer included in the convolutional neural network forclassification to a convolutional layer; and replace a regression layerin the convolutional neural network for classification with a regressionlayer configured to map the second feature output by the convertedconvolutional layer to the human head position and the correspondingconfidence level.

The training module 1250 is configured to train the convolutional neuralnetwork including the preprocessing layer, the converted convolutionallayer and the replaced regression layer by using the training imagehaving the marked human head position.

In this embodiment, the training after reconstruction is performed basedon the convolutional neural network for classification, to acquire theconvolutional neural network for human head detection. Thereconstruction of the convolutional neural network is not required, thetraining duration can be reduced and the efficiency of human headdetection is improved.

In an embodiment, the convolutional neural network module 1220 isfurther configured to map, through a first convolutional layer in theregression layer in the convolutional neural network, the second featurecorresponding to each sub-image to a human head position correspondingto each sub-image; and map, through a second convolutional layer in theregression layer in the convolutional neural network, the second featurecorresponding to each sub-image to a confidence level corresponding tothe output human head position.

As shown in FIG. 14, in an embodiment, the human head detection resultdetermining module 1230 includes a filtering module 1231 and a headposition determining module 1232.

The filtering module 1231 is configured to screen, from the human headpositions corresponding to the sub-images, to acquire a human headposition corresponding to a confidence level greater than or equal to aconfidence level threshold; and select a human head positionintersecting with the acquired human head position in the image to bedetected from the human head positions corresponding to the sub-images.

The human head position determining module 1232 is configured todetermine, according to the acquired human head position and theselected human head position, the human head position detected in theimage to be detected.

In this embodiment, the accuracy of the human head detection can befurther improved by using the confidence levels and the intersection ornot as the basis for determining the human head position in the image tobe detected.

In an embodiment, the human head position determining module 1232 isfurther configured to use the acquired human head position and theselected human head position as nodes in a bipartite graph; assigndefault and positive weights to edges between the nodes in the bipartitegraph; reduce the corresponding assigned weights when the human headpositions indicated by the nodes associated with the edges intersect;and solve a maximum weight edge combination of the bipartite graph, toacquire the head position detected in the image to be detected.

In this embodiment, since the intersecting human head position arelikely to correspond to the same human head, the human head positionsoutput by the convolutional neural network are mostly gathered near theactual human head position in the image to be detected. Therefore, theacquired human head positions and the selected human head positions areused as nodes in the bipartite graph to construct the bipartite graph,and weights of the corresponding edges of the intersecting human headpositions are relatively small. By solving the maximum weight edgecombination, the human head position detected in the image to bedetected are acquired, and the human head detection can be performedmore accurately.

As shown in FIG. 15, in an embodiment, the image to be detected is avideo frame in a video. The human head detection apparatus 1200 furtherincludes:

a tracking module 1260, configured to perform head tracking video frameby video frame according to the human head position detected in theimage to be detected;

a counting condition detecting module 1270, configured to determine amoving direction and a positional relationship of the tracked human headposition relative to the designated area; and

a people counting module 1280, configured to perform people countingbased on the determined moving direction and positional relationship.

In this embodiment, the human head detection is applied to the field ofsecurity. The people counting is performed according to the movingdirection and the positional relationship of the tracked human headposition relative to the designated area. Based on accurate human headdetection, the accuracy of people counting can be ensured.

In an embodiment, the counting condition detecting module 1270 isfurther configured to: determine that the tracked human head positionenters the designated area when the tracked human head positionsequentially spans a first line and a second line parallel with thefirst line; and determine that the tracked human head position leavesthe designated area when the tracked human head position sequentiallyspan the second line and the first line.

In this embodiment, the moving direction and the positional relationshipof the tracked human head position relative to the designated area aredetermined by two lines, thereby preventing a judgment error caused bythe moving of the human head position near a boundary of the designatedarea, thereby ensuring the correctness of people counting.

As shown in FIG. 16, in an embodiment, the human head detecting module1200 further includes a human head position acquiring module 1290.

The tracking module 1260 is further configured to track and record thehuman head position video frame by video frame.

The human head position acquiring module 1290 is configured to acquire ahuman head position tracked in a previous recorded video frame if thetracking of the human head position in a current video frame isinterrupted.

The convolutional neural network module 1220 is further configured todetect human head positions in a local area covering the acquired headposition in the current video frame.

The tracking module 1260 is further configured to continue to performthe step of tracking and recording the human head position video frameby video frame from the human head positions detected in the local area.

In this embodiment, when the tracking of the human head positions isinterrupted, the human head positions can be detected from the vicinityof the human head positions detected in the previous frame, and theinterrupted human head tracking can be continued. The human headdetection and the human head tracking are combined to ensure thecontinuity of the tracking. Further, the accuracy of people counting isensured.

It should be understood that the steps in various embodiments of thisapplication are not necessarily performed in an order indicated by thestep numbers. Unless explicitly described in this specification, thereis no strict sequence for execution of the steps. In addition, at leastsome steps in the embodiments may include a plurality of substeps or aplurality of stages. The substeps or the stages are not necessarilyperformed at a same moment, and instead may be performed at differentmoments. A performing sequence of the substeps or the stages is notnecessarily performing in sequence, and instead may be performed in turnor alternately with another step or at least some of substeps or stagesof the another step.

A person of ordinary skill in the art may understand that all or some ofthe processes of the methods in the foregoing embodiments may beimplemented by a computer program instructing relevant hardware. Theprogram may be stored in a non-volatile computer-readable storagemedium. When the program runs, the processes of the foregoing methods inthe embodiments are performed. The memory, storage, database or anyother media in the embodiments provided in this application may includea non-volatile and/or volatile memory. The non-volatile memory mayinclude: a read-only memory (ROM), a programmable ROM (PROM), anelectrically programmable ROM (EPROM), an electrically erasableprogrammable ROM (EEPROM), or a flash memory. The volatile memory mayinclude a random access memory (RAM) or an external cache memory. By wayof illustration and not limitation, the RAM may be implemented invarious forms, such as a static RAM (SRAM), a dynamic RAM (DRAM), asynchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), anenhanced SDRAM (ESDRAM), a synchlink DRAM (SLDRAM), a rambus direct RAM(RDRAM), a direct rambus dynamic RAM (DRDRAM), and a rambus dynamic RAM(RDRAM).

Various technical features in the foregoing embodiments may be randomlycombined. For ease of description, not all possible combinations of thevarious technical features in the foregoing embodiments are described.However, the combinations of the technical features should be consideredas falling within the scope recorded in this specification as long asthe combinations of the technical features are compatible with eachother.

The foregoing embodiments only describe several implementations of thisapplication, which are described specifically and in detail, andtherefore cannot be construed as a limitation to the patent scope of thepresent disclosure. It should be noted that, a person of ordinary skillin the art may make various changes and improvements without departingfrom the ideas of this application, which shall all fall within theprotection scope of this application. Therefore, the protection scope ofthe patent of this application shall be subject to the appended claims.

What is claimed is:
 1. A method for detecting human head in an imageperformed by an electronic device comprising a processor, the methodcomprising: segmenting, by the electronic device, the image into one ormore sub-images; inputting, by the electronic device, each sub-image toa convolutional neural network trained according to training imageshaving marked human head positions, and outputting, by a preprocessinglayer of the convolutional neural network comprising a firstconvolutional layer and a pooling layer, a first feature correspondingto each sub-image; mapping, by the electronic device through a secondconvolutional layer after the preprocessing layer in the convolutionalneural network, the first feature corresponding to each sub-image to asecond feature corresponding to each sub-image; mapping, by theelectronic device through a regression layer in the convolutional neuralnetwork, the second feature corresponding to each sub-image to a humanhead position corresponding to each sub-image and a correspondingconfidence level of the human head position; and filtering, by theelectronic device according to the corresponding confidence level, humanhead positions corresponding to the one or more sub-images, to acquiredetected human head positions in the image.
 2. The method according toclaim 1, wherein segmenting, by an electronic device, the image to bedetected into one or more sub-images comprises: segmenting, by theelectronic device, the image into one or more sub-images of a fixedsize, wherein adjacent sub-images in the one or more sub-imagespartially overlap.
 3. The method according to claim 1, wherein: a fullyconnected layer in a conventional convolution neural network isconverted to the second convolutional layer; a conventional regressionlayer in a conventional convolutional neural network for imageclassification is replaced by the regression layer for mapping thesecond feature output by the second convolutional layer to the humanhead position and the corresponding confidence level; and the methodfurther comprises training, by the electronic device, the convolutionalneural network comprising the preprocessing layer, the secondconvolutional layer, and the regression layer by using the trainingimages having the marked human head positions.
 4. The method accordingto claim 1, wherein mapping, by the electronic device through theregression layer in the convolutional neural network, the second featurecorresponding to each sub-image to a human head position correspondingto each sub-image and a corresponding confidence level of the human headposition comprises: mapping, by the electronic device through a thirdconvolutional layer in the regression layer of the convolutional neuralnetwork, the second feature corresponding to each sub-image to the humanhead position corresponding to each sub-image; and mapping, by theelectronic device through a fourth convolutional layer in the regressionlayer of the convolutional neural network, the second featurecorresponding to each sub-image to the confidence level corresponding tothe human head position.
 5. The method according to claim 1, whereinfiltering, by the electronic device according to the correspondingconfidence level, the human head positions corresponding to the one ormore sub-images, to acquire the detected human head positions in theimage comprises: screening, by the electronic device from the human headpositions corresponding to the one or more sub-images, to acquirescreened human head positions corresponding to confidence levels greaterthan or equal to a predetermined confidence level threshold; selecting,by the electronic device, human head positions intersecting with thescreened human head positions from the screened human head positions toobtain overlapped human head positions; and determining, by theelectronic device according to the screened human head positions and theoverlapped human head positions, the detected human head positions ofthe image.
 6. The method according to claim 5, wherein determining, bythe electronic device according to the screened human head positions andthe overlapped human head positions, the detected human head positionsof the image comprises: using, by the electronic device, the screenedhuman head positions and the overlapped human head positions as nodes ina bipartite graph; assigning, by the electronic device, default andpositive weights to edges between the nodes in the bipartite graph;reducing, by the electronic device, weights of edges in the bipartitegraph associated with the overlapped human head positions; and solving,by the electronic device, a maximum weight edge combination of thebipartite graph to obtain the detected human head positions of theimage.
 7. The method according to claim 1, wherein the image comprises avideo frame in a video, and the method further comprises: performing, bythe electronic device, human head tracking according to the detectedhuman head positions video frame by video frame; determining, by theelectronic device, a moving direction and a positional relationship ofeach of the tracked human head positions relative to a designated area;and performing, by the electronic device, people counting according tothe moving direction and positional relationship of each of the trackedhu8man head positions.
 8. The method according to claim 7, whereindetermining, by the electronic device, the moving direction and thepositional relationship of the tracked human head position relative tothe designated area comprises: determining, by the electronic device,that the tracked human head position enters the designated area when thetracked human head position sequentially crosses a first line and asecond line parallel with the first line; and determining, by theelectronic device, that the tracked human head position leaves thedesignated area when the tracked human head position sequentiallycrosses the second line and the first line.
 9. The method according toclaim 7, wherein the method further comprises: tracking and recording,by the electronic device, the detected human head positions video frameby video frame; acquiring, by the electronic device, a human headposition tracked in a previous video frame if the tracking of the humanhead position in a current video frame is interrupted; detecting, by theelectronic device, a recovered human head position in the current videoframe within a local area covering the acquired human head position inthe previous video frame; and continuing, by the electronic device,tracking and recording the recovered human head position video frame byvideo frame.
 10. An electronic device for detecting human head in animage, comprising a memory and a processor, the memory storing computerreadable instructions, the computer readable instructions, when executedby the processor, causing the processor to perform the following steps:segmenting the image into one or more sub-images; inputting eachsub-image to a convolutional neural network trained according totraining images having marked human head positions, and outputting, by apreprocessing layer of the convolutional neural network comprising afirst convolutional layer and a pooling layer, a first featurecorresponding to each sub-image; mapping, through a second convolutionallayer after the preprocessing layer in the convolutional neural network,the first feature corresponding to each sub-image to a second featurecorresponding to each sub-image; mapping, through a regression layer inthe convolutional neural network, the second feature corresponding toeach sub-image to a human head position corresponding to each sub-imageand a corresponding confidence level of the human head position; andfiltering, according to the corresponding confidence level, human headpositions corresponding to the one or more sub-images, to acquiredetected human head positions in the image.
 11. The electronic deviceaccording to claim 10, wherein segmenting, by an electronic device, theimage into one or more sub-images comprises: segmenting the image intoone or more sub-images of a fixed size, wherein adjacent sub-images inthe one or more sub-images partially overlap.
 12. The electronic deviceaccording to claim 10, wherein: a fully connected layer in aconventional convolution neural network is converted to the secondconvolutional layer; a conventional regression layer in a conventionalconvolutional neural network for image classification is replaced by theregression layer for mapping the second feature output by the secondconvolutional layer to the human head position and the correspondingconfidence level; and the computer readable instructions causes theprocessor to further perform the step of training, by the electronicdevice, the convolutional neural network comprising the preprocessinglayer, the second convolutional layer, and the regression layer by usingthe training images having the marked human head positions.
 13. Theelectronic device according to claim 10, wherein mapping, through aregression layer in the convolutional neural network, the second featurecorresponding to each sub-image to a human head position correspondingto each sub-image and a corresponding confidence level of the human headposition comprises: mapping, through a third convolutional layer in theregression layer of the convolutional neural network, the second featurecorresponding to each sub-image to the human head position correspondingto each sub-image; and mapping, through a fourth convolutional layer inthe regression layer of the convolutional neural network, the secondfeature corresponding to each sub-image to the confidence levelcorresponding to the human head position.
 14. The electronic deviceaccording to claim 10, wherein filtering, according to the correspondingconfidence level, the human head position corresponding to the one ormore sub-images, to acquire the detected human head positions in theimage to be detected comprises: screening, from the human head positionscorresponding to the one or more sub-images, to acquire screened humanhead positions corresponding to confidence levels greater than or equalto a predetermined confidence level threshold; selecting human headpositions intersecting with the screened human head positions from thescreened human head positions to obtain overlapped human head positions;and determining, according to the screened human head positions and theoverlapped human head positions, the detected human head positions. 15.The electronic device according to claim 14, wherein determining,according to the screened human head positions and the overlapped humanhead positions, the detected human head positions in the imagecomprises: using the screened human head positions and the overlappedhuman head positions as nodes in a bipartite graph; assigning defaultand positive weights to edges between the nodes in the bipartite graph;reducing weights of edges in the bipartite graph associated with theoverlapped human head positions; and solving a maximum weight edgecombination of the bipartite graph to obtain the detected human headpositions in the image.
 16. The electronic device according to claim 10,wherein the image comprises a video frame in a video; and the computerreadable instructions further causes the processor to perform thefollowing steps: performing human head tracking according to thedetected human head positions video frame by video frame; determining amoving direction and a positional relationship of each of the trackedhuman head positions relative to a designated area; and performingpeople counting according to the moving direction and positionalrelationship of each of the tracked hu8man head positions.
 17. Theelectronic device according to claim 16, wherein determining the movingdirection and the positional relationship of the tracked human headposition relative to the designated area comprises: determining that thetracked human head position enters the designated area when the trackedhuman head position sequentially crosses a first line and a second lineparallel with the first line; and determining that the tracked humanhead position leaves the designated area when the tracked human headposition sequentially crosses the second line and the first line. 18.The electronic device according to claim 16, wherein the computerreadable instructions further causes the processor to perform thefollowing steps: tracking and recording the detected human headpositions video frame by video frame; acquiring a human head positiontracked in a previous video frame if the tracking of the human headposition in a current video frame is interrupted; detecting a recoveredhuman head position in the current video frame within a local areacovering the acquired human head position in the previous video frame;and continuing tracking and recording the recovered human head positionvideo frame by video frame.
 19. A non-volatile storage medium forstoring computer readable instructions, the computer readableinstructions, when executed by one or more processors, causing the oneor more processors to perform human head detection in an image by thefollowing steps: segmenting the image into one or more sub-images;inputting each sub-image to a convolutional neural network trainedaccording to training images having marked human head positions, andoutputting, by a preprocessing layer of the convolutional neural networkcomprising a first convolutional layer and a pooling layer, a firstfeature corresponding to each sub-image; mapping, through a secondconvolutional layer after the preprocessing layer in the convolutionalneural network, the first feature corresponding to each sub-image to asecond feature corresponding to each sub-image; mapping, through aregression layer in the convolutional neural network, the second featurecorresponding to each sub-image to a human head position correspondingto each sub-image and a corresponding confidence level of the human headposition; and filtering, according to the corresponding confidencelevel, human head positions corresponding to the one or more sub-images,to acquire detected human head positions in the image.
 20. Thenon-volatile storage medium according to claim 19, wherein segmentingthe image into one or more sub-images comprises: segmenting the imageinto one or more sub-images of a fixed size, wherein adjacent sub-imagesin the one or more sub-images partially overlap.